Add agent experience testing skill, expand .claude#561
Add agent experience testing skill, expand .claude#561muhsinking wants to merge 12 commits intomainfrom
Conversation
runpod-Henrik
left a comment
There was a problem hiding this comment.
PR #561 — Add agent experience testing framework, expand .claude
No prior reviews found — this is a first-time review.
1. MCP tool name typo — .claude/testing.md
Issue: Line 413 references mcp__runpod-dops__search_runpod_documentation. The MCP server is registered as runpod-docs, so the tool name should be mcp__runpod-docs__search_runpod_documentation. The misspelling will cause all Published Docs mode test runs to fail — the tool won't resolve.
2. Test table format doesn't match documented spec
Issue: tests/README.md and .claude/testing.md both say each test has three fields: ID, Goal, and Cleanup. The actual tables in TESTS.md have columns ID | Goal | Difficulty — no Cleanup column. An agent reading a test definition won't know what resources to clean up from the table; it has to infer from the bottom section. Either update the spec to say cleanup rules are global (not per-test), or add the Cleanup column back to the tables.
3. Port limit accuracy — runpodctl-create-pod.mdx
Question: The original said "Maximum of 1 HTTP port and 1 TCP port allowed." The new text says "up to 10 HTTP ports and multiple TCP ports." Is that backed by actual runpodctl behavior? If the original limit still applies to the CLI (even if the REST API allows more), this would mislead users into configurations that fail.
4. Framework vs catalog
This is a well-conceived idea but the implementation is a catalog, not a framework. A few structural gaps will limit how useful it is in practice:
No automation layer. Tests are triggered by a human typing natural language to Claude Code. There's no runner, no CI hook, no batch mode. 85 tests that require manual one-by-one triggering will never get run systematically.
Results are ephemeral. tests/reports/ is gitignored. There's no history, no trend tracking, no way to know which tests consistently fail across doc changes.
No smoke test tier. Many tests require live GPU deploys. There's no defined fast subset (10–15 tests) suitable for running before every merge. Without that, the full suite is too expensive to run regularly.
No success criteria. Difficulty (Easy/Hard) isn't a pass condition. The agent decides what PASS means, which will be inconsistent across runs. A brief expected outcome per test (e.g. "endpoint responds with 200 to a /runsync request") would anchor the verdict.
No cleanup safety net. If a test crashes mid-run, doc_test_ resources are orphaned. A cleanup script (e.g. delete all resources matching doc_test_*) would prevent cost surprises.
The local-docs mode for pre-merge validation is the most immediately useful feature here. The published-docs batch testing vision is worth pursuing but needs the automation layer first.
Nits
.gitignoreis missing a trailing newline.- Double
---separator before the Cleanup Rules section inTESTS.md— looks like a copy-paste artifact.
Verdict
NEEDS WORK — The MCP tool name typo (#1) silently breaks Published Docs mode. The Cleanup column mismatch (#2) creates ambiguity for running agents. The port limit change (#3) needs factual verification. Section 4 is not a blocker but worth discussing before the suite grows further.
🤖 Reviewed by Henrik's AI-Powered Bug Finder
runpod-Henrik
left a comment
There was a problem hiding this comment.
Delta review — since 2026-03-20 13:52
1. Issue resolved — MCP tool name typo
Fixed prior to last review (per improvement plan). Confirmed: .claude/testing.md and .claude/commands/test.md both use mcp__runpod-docs__search_runpod_documentation.
2. Issue resolved — Test table format mismatch
Option B adopted: cleanup rules are global (defined once at the bottom), not per-test. Tables now have ID | Goal | Expected Outcome, with a callout pointing to the global Cleanup Rules section. Clear and consistent.
3. Issue resolved — Port limit accuracy
Verified against pods/configuration/expose-ports.mdx (confirmed 10 HTTP ports). The new text is accurate.
4. Structural gaps — status update
| Gap | Status |
|---|---|
| No smoke test tier | Resolved — 12 smoke tests added (no GPU deploys) |
| No success criteria | Resolved — Difficulty column replaced with Expected Outcome across all tables |
| No cleanup safety net | Resolved — cleanup.py added with dry-run and --delete flags |
| Ephemeral results | Partially addressed — dual-location saving (tests/reports/ + ~/Dev/doc-tests/), stats.py for trend tracking |
| No automation layer | Still deferred — acceptable for now |
5. Nits resolved
.gitignore trailing newline and double --- separator in TESTS.md both fixed.
6. New: batch mode — minor confirmation step inconsistency
.claude/commands/test.md specifies step 2 of batch execution as "Show test list — ask for confirmation before running." .claude/testing.md's batch section omits this step and goes straight to "Run sequentially." An agent following testing.md will run a full category silently without confirmation. Worth aligning — the confirmation step in commands/test.md is the safer behaviour.
Nits
- Category test counts in
commands/test.md(e.g.,serverless | 20,pods | 11) will silently drift as tests are added. Either remove the count column or note it's approximate. Low impact but will mislead when the table goes stale. IMPROVEMENT_PLAN.mdremoved — correct, all items tracked there are done.
Verdict
PASS — all blockers from prior reviews are resolved. The batch mode addition is well-structured and consistent across the three files that needed updating. One minor behaviour gap in batch confirmation, but not a blocker.
🤖 Reviewed by Henrik's AI-Powered Bug Finder
Agent experience testing skill
Summary
Adds a lightweight framework for testing documentation quality by having AI coding agents attempt real-world tasks using only the docs. Tests reveal documentation gaps by simulating what happens when a user asks "how do I deploy a vLLM endpoint?" without any prior context.
Philosophy
Tests are intentionally hard to pass. Each test is a single sentence—no hints, no steps, no doc references. If the docs are good, an agent can figure it out. If not, the test reveals exactly where users get stuck.
How it works
tests/TESTS.md)Run the vllm-deploy testTwo doc source modes
Run the vllm-deploy testRun the vllm-deploy test using local docsLocal mode reads
.mdxfiles directly from the repo, letting you test doc changes on a branch before merging.Test coverage
~85 tests across 13 product areas:
Test format
Tests are minimal by design:
That's it. The agent must figure out everything else from the docs.
Report output
After each test, reports are saved to
tests/reports/(gitignored):Files changed
Requirements
Safety
doc_test_prefix